Data Analysis Process¶

Asking Questions (Objectives)¶

  • Given Data and Ask Questions to be answered using analysis

Data Wrangling¶

  • Gather Collect the data needed to answer your questions
  • Assessing Explore the data , check data quality , identifying the issues
  • Cleaning fix issues by modifying,Replacing,Renaming,Removing problamtic data

Perform EDA (Exploratory Data Analysis)¶

  • Explore Data using statistics and Visuals
  • Discover Data Pattern
  • Understand Data Distribution

Draw Conclusion¶

  • Summeriza key findings

Communicate Results¶

  • Share the Result

Instgram Dataset Field Description¶

  • Below is a description of column fields in the Dataset:

Core Fields¶

Impressions – total number of times the post was seen

From Home – impressions from followers’ home feed

From Hashtags – impressions from hashtags

From Explore – impressions from explore page

From Other – impressions from other sources (shares, profile, etc.)

Questions:¶

  • Which post got the most impressions?

  • Which post got the most likes?

  • Which post got the most saves?

  • What is the average number of impressions per post?

  • What is the average number of likes per post?

  • Which source gives most impressions (Home, Hashtags, Explore)?

  • Do posts with more hashtags get more impressions?

  • Do longer captions get more engagement?

  • Which posts bring the most profile visits?

  • Which posts bring the most followers?

  • What is the correlation between impressions and likes?

  • What is the correlation between impressions and saves?

  • What is the total engagement for each post?

  • Which posts have the highest engagement rate?

  • Does Explore page help increase impressions?

In [174]:
# load needed modules 
import pandas as pd 
In [175]:
# display all the columns
pd.options.display.max_columns = None 
In [176]:
# Load the dataset into datframe 
df = pd.read_csv(r'C:\Users\go\Downloads\.ipynb_checkpoints\Instgram.csv')
In [177]:
# display first three rows
df.head(3)
Out[177]:
Impressions From Home From Hashtags From Explore From Other Saves Comments Shares Likes Profile Visits Follows Caption Hashtags
0 3920 2586 1028 619 56 98 9 5 162 35 2 Here are some of the most important data visua... #finance #money #business #investing #investme...
1 5394 2727 1838 1174 78 194 7 14 224 48 10 Here are some of the best data science project... #healthcare #health #covid #data #datascience ...
2 4021 2085 1188 0 533 41 11 1 131 62 12 Learn how to train a machine learning model an... #data #datascience #dataanalysis #dataanalytic...
In [178]:
df['Caption'][0]
Out[178]:
'Here are some of the most important data visualizations that every Financial Data Analyst/Scientist should know.'
In [179]:
# print the data shape
df.shape
Out[179]:
(119, 13)
  • we found that our data contains 119 posts with 13 features
In [180]:
# investigate data properties 
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Impressions     119 non-null    int64 
 1   From Home       119 non-null    int64 
 2   From Hashtags   119 non-null    int64 
 3   From Explore    119 non-null    int64 
 4   From Other      119 non-null    int64 
 5   Saves           119 non-null    int64 
 6   Comments        119 non-null    int64 
 7   Shares          119 non-null    int64 
 8   Likes           119 non-null    int64 
 9   Profile Visits  119 non-null    int64 
 10  Follows         119 non-null    int64 
 11  Caption         119 non-null    object
 12  Hashtags        119 non-null    object
dtypes: int64(11), object(2)
memory usage: 12.2+ KB
  • Completeness: All 12 columns have 119 non-null entries, matching the presumed total number of rows (119). This means there are no missing values in the dataset. This simplifies cleaning immensely.

  • Data Types: The data types are correctly assigned:

    Numerical Data (int64): All engagement and reach metrics (Impressions, From Home, Saves, Comments, Likes, etc.) are correctly read as integers. This is ideal for mathematical aggregation and statistical analysis.

  • Text Data (object): Caption and Hashtags are correctly identified as object (string) types, which is necessary for text analysis and feature engineering.

In [181]:
df_copy = df.copy()
In [182]:
#check for duplicates 
df.duplicated().sum()
Out[182]:
np.int64(17)
  • in our data no duplicates
In [183]:
# check for missing value 
df.isnull().sum()
Out[183]:
Impressions       0
From Home         0
From Hashtags     0
From Explore      0
From Other        0
Saves             0
Comments          0
Shares            0
Likes             0
Profile Visits    0
Follows           0
Caption           0
Hashtags          0
dtype: int64
  • in our data no missing values

All data in our file is critical . we don't need drop any column or rename any column¶

In [184]:
df.dtypes
Out[184]:
Impressions        int64
From Home          int64
From Hashtags      int64
From Explore       int64
From Other         int64
Saves              int64
Comments           int64
Shares             int64
Likes              int64
Profile Visits     int64
Follows            int64
Caption           object
Hashtags          object
dtype: object

Feature Engineering¶

  • Create new useful columns
  • (e.g., engagement rate, growth rate, time-based features)
In [185]:
df['engagement rate']=(df['Comments']+df['Likes']+df['Shares'])/df['From Home']*100
In [186]:
df.head(1)
Out[186]:
Impressions From Home From Hashtags From Explore From Other Saves Comments Shares Likes Profile Visits Follows Caption Hashtags engagement rate
0 3920 2586 1028 619 56 98 9 5 162 35 2 Here are some of the most important data visua... #finance #money #business #investing #investme... 6.805878
In [187]:
q_rate70 = df['engagement rate'].quantile(0.70)
q_rate70
Out[187]:
np.float64(8.496093232684412)
  • there are 70% of engagement rate in our data less than (8.5%)
In [188]:
df['reach class'] = df['engagement rate'].apply(lambda x : "High Rate" if x > q_rate70 else "Low Rate")
In [189]:
df.head(1)
Out[189]:
Impressions From Home From Hashtags From Explore From Other Saves Comments Shares Likes Profile Visits Follows Caption Hashtags engagement rate reach class
0 3920 2586 1028 619 56 98 9 5 162 35 2 Here are some of the most important data visua... #finance #money #business #investing #investme... 6.805878 Low Rate
In [190]:
# convert ['Hashtags'] and ['captions'] to string
df['Hashtags'] = df['Hashtags'].astype(str)
df['Caption'] = df['Caption'].astype(str)
In [191]:
#convert hashtags and caption to list (to acess on them)
df['Hashtags'] = df['Hashtags'].str.split('#')
In [192]:
df.head(1)
Out[192]:
Impressions From Home From Hashtags From Explore From Other Saves Comments Shares Likes Profile Visits Follows Caption Hashtags engagement rate reach class
0 3920 2586 1028 619 56 98 9 5 162 35 2 Here are some of the most important data visua... [, finance , money , business , investing , in... 6.805878 Low Rate
In [193]:
df['Hashtags'].dtype
Out[193]:
dtype('O')
In [194]:
#explode hashtages to analysis it 
df_hashtags= df.explode('Hashtags')
In [195]:
df_hashtags.shape
Out[195]:
(2375, 15)
In [196]:
df_hashtags.head()
Out[196]:
Impressions From Home From Hashtags From Explore From Other Saves Comments Shares Likes Profile Visits Follows Caption Hashtags engagement rate reach class
0 3920 2586 1028 619 56 98 9 5 162 35 2 Here are some of the most important data visua... 6.805878 Low Rate
0 3920 2586 1028 619 56 98 9 5 162 35 2 Here are some of the most important data visua... finance 6.805878 Low Rate
0 3920 2586 1028 619 56 98 9 5 162 35 2 Here are some of the most important data visua... money 6.805878 Low Rate
0 3920 2586 1028 619 56 98 9 5 162 35 2 Here are some of the most important data visua... business 6.805878 Low Rate
0 3920 2586 1028 619 56 98 9 5 162 35 2 Here are some of the most important data visua... investing 6.805878 Low Rate
In [197]:
high_rate_posts=df[df['reach class']=='High Rate']
In [198]:
high_rate_posts.head(1)
Out[198]:
Impressions From Home From Hashtags From Explore From Other Saves Comments Shares Likes Profile Visits Follows Caption Hashtags engagement rate reach class
1 5394 2727 1838 1174 78 194 7 14 224 48 10 Here are some of the best data science project... [, healthcare , health , covid , data , datasc... 8.984232 High Rate
In [199]:
#statical measures of High Rate posts
high_rate_posts[['Impressions','From Home','From Hashtags','From Explore','From Other','Follows']].describe()
Out[199]:
Impressions From Home From Hashtags From Explore From Other Follows
count 36.000000 36.000000 36.000000 36.000000 36.000000 36.000000
mean 7407.361111 2408.722222 3446.972222 1283.055556 182.083333 29.722222
std 3740.155002 590.821056 2508.970847 2291.808618 171.485339 47.425497
min 3630.000000 1711.000000 621.000000 36.000000 23.000000 0.000000
25% 4885.750000 2012.750000 1923.500000 285.500000 73.750000 6.000000
50% 6449.000000 2195.000000 2351.000000 552.500000 112.500000 13.000000
75% 9686.250000 2706.750000 4172.250000 1008.000000 227.000000 32.500000
max 17713.000000 4137.000000 11817.000000 12389.000000 794.000000 260.000000
In [200]:
# load needed modules 
from collections import Counter
In [201]:
df['Hashtags']
Out[201]:
0      [, finance , money , business , investing , in...
1      [, healthcare , health , covid , data , datasc...
2      [, data , datascience , dataanalysis , dataana...
3      [, python , pythonprogramming , pythonprojects...
4      [, datavisualization , datascience , data , da...
                             ...                        
114    [, datascience , datasciencejobs , datascience...
115    [, machinelearning , machinelearningalgorithms...
116    [, machinelearning , machinelearningalgorithms...
117    [, datascience , datasciencejobs , datascience...
118    [, python , pythonprogramming , pythonprojects...
Name: Hashtags, Length: 119, dtype: object
In [202]:
Hashtags = [y for x in high_rate_posts['Hashtags'] for y in x]
In [203]:
Hashtags_exp=df.explode('Hashtags')
In [222]:
top_5_hashtags = Counter(Hashtags).most_common(5)
In [223]:
# top 5 hashtags gain reach in posts
top_5_hashtags
Out[223]:
[('', 36),
 ('amankharwal\xa0', 36),
 ('python\xa0', 34),
 ('datascience\xa0', 33),
 ('dataanalytics\xa0', 32)]
In [206]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
In [207]:
#relationship between engagement rate and Impressions
sns.scatterplot(data=df,x='Impressions',y='engagement rate')
plt.title('Impressions vs engagement rate ')
plt.xlabel('Impressions')
plt.ylabel('engagement rate')
plt.show()
No description has been provided for this image
In [208]:
sns.regplot(data=df,x='Impressions',y='engagement rate')
plt.title('Impressions vs engagement rate ')
plt.xlabel('Impressions')
plt.ylabel('engagement rate')
plt.show()
No description has been provided for this image
In [209]:
df.plot(x='engagement rate',y='Impressions' , kind='scatter')
Out[209]:
<Axes: xlabel='engagement rate', ylabel='Impressions'>
No description has been provided for this image
In [210]:
sns.displot(df['engagement rate'],kde=True)
Out[210]:
<seaborn.axisgrid.FacetGrid at 0x25b7047f110>
No description has been provided for this image
In [234]:
df=df.explode('Hashtags')
In [262]:
top_15_hashtags = df['Hashtags'].value_counts().head(15)
In [263]:
top_15_hashtags
Out[263]:
Hashtags
                           119
python                     109
amankharwal                107
machinelearning             96
pythonprogramming           95
datascience                 94
ai                          91
artificialintelligence      89
data                        88
dataanalytics               87
datascientist               83
pythonprojects              82
pythoncode                  78
dataanalysis                77
deeplearning                75
Name: count, dtype: int64
In [261]:
df['Hashtags'].value_counts()[:9].plot(kind='bar')
Out[261]:
<Axes: xlabel='Hashtags'>
No description has been provided for this image
In [264]:
plt.figure(figsize=(12,5))
sns.barplot(top_15_hashtags)
plt.xticks(rotation = 10)
plt.show()
No description has been provided for this image
In [265]:
plt.figure(figsize=(12,5))
sns.barplot(top_15_hashtags,orient='h')
plt.show()
No description has been provided for this image
In [269]:
Hash_avg_rate = df.groupby('Hashtags')['engagement rate'].mean().sort_values(ascending=False)
Hash_avg_rate15 = df.groupby('Hashtags')['engagement rate'].mean().sort_values(ascending=False).head(15)
In [267]:
Hash_avg_rate
Out[267]:
Hashtags
sql                  13.638220
mysql                13.638220
roadmap              11.158650
covid                11.027272
healthcare           11.027272
                       ...    
programmingmemes      5.215772
php                   5.215772
programmers           5.215772
webdesign             5.215772
facebook              5.129651
Name: engagement rate, Length: 176, dtype: float64
In [270]:
plt.figure(figsize=(12,5))
sns.barplot(Hash_avg_rate15)
plt.xticks(rotation = 45)
plt.show()
No description has been provided for this image
In [271]:
px.bar(top_15_hashtags)
In [275]:
px.scatter(df,x='Follows',y = 'Impressions',trendline='ols',color='engagement rate')
In [276]:
corr_matrix = df[['engagement rate','Impressions','Follows','Likes']].corr()
In [277]:
corr_matrix
Out[277]:
engagement rate Impressions Follows Likes
engagement rate 1.000000 0.366108 0.422574 0.595362
Impressions 0.366108 1.000000 0.884286 0.856445
Follows 0.422574 0.884286 1.000000 0.736817
Likes 0.595362 0.856445 0.736817 1.000000
In [278]:
sns.heatmap(corr_matrix,annot=True,cmap='coolwarm')
Out[278]:
<Axes: >
No description has been provided for this image
In [279]:
px.pie(df,names='reach class')
In [282]:
px.pie(names = top_15_hashtags.index , values=top_15_hashtags.values ,title='Top15 Hashtags')
In [284]:
##################################
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: